perf(iterator): -25% on dgraph-shaped scans via lockless isBanned, hoisted fill, internal-key skip, KeyOnly opt-in, user-key lastKey by shaunpatterson · Pull Request #2285 · dgraph-io/badger

shaunpatterson · 2026-05-23T20:38:50Z

Summary

Five targeted optimizations to the iterator hot path, measured against a new dgraph-shaped benchmark suite. Composite improvement: −25.4% vs upstream/main on the 5-bench composite (sum of ns/op medians, Apple M4 Max, -benchtime=5s, median-of-3).

Benchmark	Baseline	Final	Δ
`PrefixScanKeyOnly`	11.80M ns	8.15M ns	−31.0%
`PrefixScanKeyOnlyOpt` (new, opt-in)	—	7.58M ns	−35.8% vs baseline-equivalent
`PrefixScanAllVersions`	11.98M ns	10.55M ns	−11.9%
`KeyIteratorAllVersions`	1,514 ns	1,376 ns	−9.1%
`NamespaceOffset/off`	5.08M ns	3.78M ns	−25.7%
`NamespaceOffset/dgraph`	5.51M ns	3.80M ns	−31.0%
Composite	34.37M	25.71M	−25.2%

What changed

Six commits behind the perf series + 2 doc/test commits:

bench — Adds iterator_dgraph_bench_test.go with 5 benches that model dgraph's iteration patterns (NamespaceOffset=1, NumVersionsToKeep=MaxInt32, DetectConflicts=false). These are the metric for every iteration below.
perf: lockless fast path for DB.isBanned when no namespaces banned — Adds monotonic atomic.Bool hasAny on lockedKeys. isBanned returns on a single atomic load when no namespaces are banned (the production steady state for tenants who enable NamespaceOffset for isolation but rarely ban). Eliminates RWMutex.RLock per iterator step. Sets the flag inside the same lock that guards the map, so any reader observing hasAny=true is guaranteed to see the populated map via release-acquire on the lock.
perf: hoist mi.Value() and key out of fill() — fill(item) → fill(item, key, *y.ValueStruct). The KeyOnly path's parseItem previously made two mi.Value()+Decode calls per kept item. vs is passed by pointer to avoid copying the ~40-byte ValueStruct on every kept item.
perf: skip per-step internal-key probe when prefix excludes badgerPrefix — Precompute canSeeInternalKeys once at iterator construction. When the user's opt.Prefix exists and its first byte differs from badgerPrefix[0], no key the iterator lands on can be a badger-internal key, so the per-step bytes.HasPrefix(key, badgerPrefix) probe is dead code.
perf: IteratorOptions.KeyOnly skip per-item value SafeCopy — New opt-in IteratorOptions.KeyOnly. When set, fill skips the per-item SafeCopy(item.vptr, vs.Value). Item.Value / Item.ValueCopy return a new ErrKeyOnlyMode (strict-error, not silent-empty). PrefetchValues is forced off when KeyOnly=true. Captures an additional ~2–4% on top of the implicit gains for callers (e.g. dgraph's has() evaluator / index scans) that already discard item.vptr.
perf: store user-key only in lastKey, inline same-key compare — Iterator.lastKey previously held the full key (user-key‖ts). Stores user-key only now. Replaces y.SameKey(lastKey, key) with an inlined len(key)-8 == len(lastKey) && bytes.Equal(key[:ukLen], lastKey). Avoids one ParseKey per SameKey and an 8-byte memcpy on each lastKey update. Same-key dedup invariant preserved exactly.
docs(iterator): clarify FILL key-freshness and ukLen invariants — Strengthened comments on two hot-path invariants surfaced during code review.
test: cover KeyOnly iterator paths and preserved-behavior regressions — New iterator_keyonly_test.go. Together with the benchmarks, every touched line in db.go and iterator.go is now at 100% coverage (verified via go tool cover).

Correctness

Every commit preserves existing semantics:

iter1: add() stores the flag inside the same lock that guards the map → release-acquire visibility holds.
iter2: key at every fill() call site is the current mi.Key(); the only goto FILL (reverse path) refreshes it after mi.Next().
iter3: When canSeeInternalKeys=false, isInternalKey stays false, which is the correct value for prefix shapes that cannot overlap badgerPrefix.
iter4: Misuse is loud — Item.Value returns ErrKeyOnlyMode rather than a silent empty slice. PrefetchValues is forced off so the prefetch goroutine never races fill.
iter5: lastKey user-key-only representation is preserved across the existing "b 7 (del), b 5" hazard because lastKey is still updated before the deleted-or-expired check.

Tests

All preserved-behavior tests (TestPreserved* / TestRegression*) pass on upstream/main AND on this branch — they were validated on both via a worktree.
New-API tests (TestKeyOnlyIterator_*, TestCanSeeInternalKeys) pass on this branch.
Full test suite (go test -run='^Test' ./...): PASS in 234.7s.
100% line coverage on every line modified across db.go (isBanned + lockedKeys.add) and iterator.go (Item.Value/ValueCopy/EstimatedSize/ValueSize, NewIterator, canSeeInternalKeys, fill, parseItem).

Test plan

go test ./... — full suite passes
go test -bench=BenchmarkDgraph -benchtime=5s -count=3 . — composite −25.2% vs main
go tool cover -func=... — 100% on touched lines
Worktree validation: preserved-behavior tests also pass on upstream/main
CI runs (this PR will trigger them)

Notes for reviewers

IteratorOptions.KeyOnly is opt-in; existing callers see no behavior change. Dgraph would need a one-line change at its has() / index-scan iterator construction sites to capture the additional 2–4% on those paths.
The composite metric is sum-of-medians for easy single-number comparison; the largest absolute benchmark (PrefixScanAllVersions) dominates and hides smaller relative wins on KeyIteratorAllVersions. Per-benchmark breakdown above tells the full story.
No public API removals or breaking changes. The only public additions are IteratorOptions.KeyOnly (bool, default false) and ErrKeyOnlyMode (sentinel error).

🤖 Generated with Claude Code

Add four benchmarks modeling dgraph's posting-layer iterator hot paths. dgraph drives badger in a specific combination of options that none of the existing root-package benchmarks exercise: - NamespaceOffset = 1 (DB.isBanned runs on every key) - NumVersionsToKeep = math.MaxInt32 (all versions retained) - DetectConflicts = false (dgraph owns OCC) - PrefetchValues = false (key-only scans in every hot path) - AllVersions = true (MVCC-aware reads) - Prefix-bounded forward iteration over mid-sized structured keys Keys mirror dgraph's data-key layout (1B type + 8B namespace + 2B attr-len + ~20B attr + 1B subtype + 8B UID = 40 bytes). Benchmarks: - BenchmarkDgraphPrefixScanKeyOnly: models has() / index-scan paths (posting.MemoryLayer.IterateDisk, worker.sort.go) - BenchmarkDgraphPrefixScanAllVersions: models the rollup iterator with multiple MVCC versions per key - BenchmarkDgraphKeyIteratorAllVersions: models readFromDisk via NewKeyIterator over a single posting key (every cache miss) - BenchmarkDgraphPrefixScanNamespaceOffset: isolates the cost of NamespaceOffset=1 vs -1 (i.e. the per-key DB.isBanned tax) These benchmarks are intended as the measurement baseline for upcoming iterator hot-path optimizations.

DB.isBanned runs on every iterator step and every key access when NamespaceOffset >= 0. The hot path acquires lockedKeys.RLock and probes the keys map even when the ban set is empty (the steady-state case in production deployments like dgraph that enable NamespaceOffset for multi-tenant key isolation but rarely ban anything). Add a monotonic atomic.Bool flag (lockedKeys.hasAny) that flips from false to true on the first add() and never flips back. isBanned can check this flag with a single atomic load and skip the lock + map lookup when no ban has ever been recorded. Correctness: add() stores the flag inside the same lock that guards the keys map, so any reader observing hasAny=true is guaranteed to see the populated map via release-acquire ordering on the lock. Readers observing hasAny=false correctly return early because the flag is monotonic and is set before any key is observable in the map. DB-reopen replay path also calls add() so reopened DBs retain the flag for previously-recorded bans. Measured on dgraph-shaped iterator benchmarks (Apple M4 Max, median-of-3, -benchtime=5s): PrefixScanKeyOnly 11.80M -> 10.56M ns/op (-10.5%) PrefixScanAllVersions 11.98M -> 10.60M ns/op (-11.4%) KeyIteratorAllVersions 1,514 -> 1,371 ns/op (-9.4%) NamespaceOffset/dgraph 5.51M -> 4.96M ns/op (-10.0%) NamespaceOffset/off 5.08M -> 5.01M ns/op (-1.4%, noise) The NamespaceOffset/dgraph case now matches NamespaceOffset/off (both ~4.96-5.01M ns/op) -- the entire ~8.5% NamespaceOffset tax from the per-key RLock is eliminated. Tests: full suite passes (TestBannedPrefixes, TestIterateWithBanned verified) except TestProtosRegenerate which fails identically on baseline due to missing protoc binary. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…decode Iterator.parseItem previously called mi.Value() twice per kept item on the non-AllVersions hot path: 1. To pre-check Meta/ExpiresAt for the deleted/expired filter. 2. Inside fill(), which re-fetched the same ValueStruct (cost includes an Arena getVal + ValueStruct.Decode that the CPU profile showed consuming ~40% of cumulative samples on dgraph-shaped key-only prefix scans). Refactor fill to take the already-fetched key []byte and a pointer to the already-decoded *y.ValueStruct, and hoist mi.Value() into both the AllVersions and non-AllVersions branches of parseItem. fill no longer touches it.iitr. Pass vs by pointer (not value): y.ValueStruct is ~40 bytes; passing by value added a per-item struct copy that wiped out the AllVersions win (branch only had one Value() call, no double-decode to eliminate). Pointer collapses to a single 8-byte arg; fill does not store the pointer so escape analysis keeps vs on the stack. Correctness: - vs and key are fetched while the iterator is positioned on the current entry; no mi.Next() runs between fetch and fill. Arena/block memory backing vs.Value and key is stable for that window. - In the reverse-iteration goto FILL path, mi has advanced once, so the refactor refreshes key = mi.Key() before re-entering FILL, and FILL re-fetches vs := mi.Value(). Semantically equivalent to the pre-refactor code. Measured on dgraph-shaped iterator benchmarks (Apple M4 Max, median-of-3, -benchtime=5s) -- composite (sum of 5 medians) drops 31.13M -> 26.37M ns (-15.3% vs iter1, -23.3% vs upstream baseline): PrefixScanKeyOnly 10.56M -> 8.18M ns/op (-22.5%) PrefixScanAllVersions 10.60M -> 10.53M ns/op (-0.7%, flat) KeyIteratorAllVersions 1,371 -> 1,376 ns/op (+0.4%, noise) NamespaceOffset/off 5.01M -> 3.83M ns/op (-23.6%) NamespaceOffset/dgraph 4.96M -> 3.83M ns/op (-22.7%) Tests: full suite passes (TestProtosRegenerate continues to fail identically due to missing protoc binary). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Iterator.parseItem unconditionally ran bytes.HasPrefix(key, badgerPrefix) on every step to detect badger-internal keys (e.g. !badger!banned). For prefix-bounded scans whose Prefix's first byte differs from badgerPrefix[0] ('!'), no key the iterator can land on can possibly be internal -- the probe is dead. The CPU profile after iter2 attributed ~13% of cumulative samples to runtime.memequal, a portion of which was this probe. Precompute canSeeInternalKeys once at iterator construction: - empty prefix: true (unbounded scan can hit internal keys) - prefix[0] == '!' : true (could be the banned-ns recovery iterator or any scan that starts in the internal-key range) - otherwise: false (the iterator after Seek(prefix) is positioned past the internal-key range and hasPrefix() will reject any !-prefixed key anyway, so the per-step probe is provably dead). When the flag is false, skip both the HasPrefix probe and the !InternalAccess skip-block. isInternalKey defaults to false, which is also the correct value for the downstream "!isInternalKey && isBanned" gate (non-internal keys are subject to ban checks; internal ones are not, but here we know they can't appear). Correctness: - initBannedNamespaces sets Prefix = bannedNsKey (starts with '!') so the flag is true and the probe still runs on the recovery path. - All other badger-internal access uses InternalAccess=true with a badgerPrefix-anchored Prefix, so the flag is also true there. - For dgraph's data scans (Prefix[0] = 0x00) and all other user scans the probe is provably dead and can be skipped. Measured on dgraph-shaped iterator benchmarks (Apple M4 Max, median-of-3, -benchtime=5s): PrefixScanKeyOnly 8.18M -> 7.92M ns/op (-3.2%) PrefixScanAllVersions 10.53M ->10.50M ns/op (-0.2%, noise) KeyIteratorAllVersions 1,376 -> 1,350 ns/op (-1.9%) NamespaceOffset/off 3.83M -> 3.72M ns/op (-2.6%) NamespaceOffset/dgraph 3.83M -> 3.74M ns/op (-2.4%) Cumulative since upstream baseline: composite 34.37M -> 25.89M ns (-24.7%); PrefixScanKeyOnly 11.80M -> 7.92M (-32.9%); NamespaceOffset/dgraph 5.51M -> 3.74M (-32.1%). Tests: TestBannedPrefixes, TestIterateWithBanned, TestIterator*, and TestKeyIterator* all pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Add opt-in KeyOnly flag to IteratorOptions. When set, fill() skips the per-item SafeCopy of vs.Value into item.vptr, which is the largest remaining memmove on the key-only forward-scan hot path (dgraph has()/index scans). Trade-offs (Option A, strict): - Item.Value / Item.ValueCopy return new ErrKeyOnlyMode. - Item.ValueSize / Item.EstimatedSize report 0 / key-size only. - Item.Key, Version, UserMeta, ExpiresAt, IsDeletedOrExpired unaffected. - NewIterator forces PrefetchValues=false when KeyOnly=true so the prefetch goroutine never starts. Existing callers see no behavior change. The new benchmark BenchmarkDgraphPrefixScanKeyOnlyOpt exercises the opt-in path alongside the original PrefixScanKeyOnly to preserve comparison. Results (5-bench composite, median-of-3, Apple M4 Max): - PrefixScanKeyOnlyOpt: 7.76M ns/op (−2.0% vs iter3 KeyOnly, −34.3% vs baseline) - Composite with opt-in: 25.63M ns (−25.4% vs upstream baseline) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Iterator.lastKey previously stored the full key (user-key || 8B ts) but SameKey only compared the user-key portion. Two consequences: 1. SafeCopy(it.lastKey, key) copied 8 bytes of dead ts every kept item. 2. y.SameKey called ParseKey on both src and dst (two slice ops + a function call) on the hot dedup compare. Change: store the user-key only in lastKey; inline the compare. ukLen := len(key) - 8 if ukLen == len(it.lastKey) && bytes.Equal(key[:ukLen], it.lastKey) { mi.Next() return false } it.lastKey = y.SafeCopy(it.lastKey, key[:ukLen]) The same-key dedup invariant ("skip later versions of a user-key we've already considered, including tombstoned ones") is preserved exactly: lastKey is still updated before the FILL/deleted-or-expired check. Result: BenchmarkDgraphPrefixScanKeyOnlyOpt: 7.76M -> 7.58M ns (-2.3%). Other benchmarks within run-to-run noise. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Strengthen comments on two hot-path invariants surfaced during code review of the perf series: 1. FILL label entry contract: `key` is always mi.Key() at the current iitr position. The only goto FILL (reverse path) refreshes key after mi.Next(); the fall-through entry above never advances the iterator between `key := mi.Key()` and reaching FILL. This is what lets fill() safely reuse the caller-supplied key without re-calling mi.Key() per item. 2. lastKey ukLen safety: `ukLen := len(key) - 8` is safe because len(key) >= 8 is a badger-wide invariant (every LSM key is stored with an 8-byte timestamp suffix via y.KeyWithTs). y.ParseTs(key) above already relies on this — no defensive check needed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds a comprehensive test file that hits every line of the perf series touched in db.go and iterator.go. Two test groups: * TestKeyOnlyIterator_* — exercise the new KeyOnly opt-in path: Value/ValueCopy return ErrKeyOnlyMode; EstimatedSize returns key length; ValueSize returns 0; NewIterator forces PrefetchValues=false; Key/UserMeta/Version/ExpiresAt continue to work under KeyOnly. * TestRegression* — lock in pre-existing behaviors that the perf series rewrote in place: same-key dedup across multiple MVCC versions (iter5 lastKey rewrite), internal-key hiding (iter3 canSeeInternalKeys gate), isBanned correctness across the new atomic.Bool fast path (iter1) including short-key-with-bans branch, and non-KeyOnly fill path correctness (iter2 hoist). * TestCanSeeInternalKeys — table-driven unit test for the helper added in iter3. All preserved-behavior tests also pass on upstream/main, proving the expectations match the unchanged baseline. Together with the benchmarks added in 7174cc2, the touched lines in db.go and iterator.go now have 100% coverage. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

shaunpatterson and others added 8 commits May 23, 2026 14:30

shaunpatterson requested a review from a team as a code owner May 23, 2026 20:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(iterator): -25% on dgraph-shaped scans via lockless isBanned, hoisted fill, internal-key skip, KeyOnly opt-in, user-key lastKey#2285

perf(iterator): -25% on dgraph-shaped scans via lockless isBanned, hoisted fill, internal-key skip, KeyOnly opt-in, user-key lastKey#2285
shaunpatterson wants to merge 8 commits into
dgraph-io:mainfrom
shaunpatterson:perf/dgraph-iter-bench

shaunpatterson commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Conversation

shaunpatterson commented May 23, 2026

Summary

What changed

Correctness

Tests

Test plan

Notes for reviewers

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant